Cryptocurrencies Price Prediction Using Deep Learning LSTM¶

Project Overview¶

This project aims to predict the stock prices of cryptocurrencies using historical stock data. We will be using a machine learning algorithm called Long Short-Term Memory (LSTM) to train our model on the stock prices of cryptocurrencie for the past 5 years. Our goal is to create a model that can accurately predict the stock prices for the next 30 days based on the historical data. In addition to our model, we will also be adding visualizations to help us better understand our data and results.

Importing the libraries¶

In [1]:
import pandas as pd
import numpy as np
from binance.client import Client
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

Importing data and preprocessing¶

In [2]:
# Variables
# -- You can change the crypto pair ,the start date and the time interval below --
client = Client()
pair_symbol = "BTCUSDT"
time_interval = Client.KLINE_INTERVAL_1DAY
start_date = "01 january 2017"

# Fetch data
klinesT = client.get_historical_klines(pair_symbol, time_interval, start_date)

# Create dataframe
df = pd.DataFrame(klinesT, columns = ['timestamp', 'open', 'high', 'low', 'close', 'volume', 'close_time', 'quote_av', 'trades', 'tb_base_av', 'tb_quote_av', 'ignore' ])

# Drop unnecessary columns
df.drop(['close_time', 'quote_av', 'trades', 'tb_base_av', 'tb_quote_av', 'ignore'], axis=1, inplace=True)

# Convert columns to numeric
for col in df.columns:
    df[col] = pd.to_numeric(df[col])

# Convert dates to datetime
df = df.set_index(df['timestamp'])
df.index = pd.to_datetime(df.index, unit='ms')
del df['timestamp']

# Split the df to train set and test set and define the learning period
training_set = df.copy().loc["2017":"2021"]
test_set = df.copy().loc["2022":]
learn_period = 7

training_set
Out[2]:
open high low close volume
timestamp
2017-08-17 4261.48 4485.39 4200.74 4285.08 795.150377
2017-08-18 4285.08 4371.52 3938.77 4108.37 1199.888264
2017-08-19 4108.37 4184.69 3850.00 4139.98 381.309763
2017-08-20 4120.98 4211.08 4032.62 4086.29 467.083022
2017-08-21 4069.13 4119.62 3911.79 4016.00 691.743060
... ... ... ... ... ...
2021-12-27 50775.48 52088.00 50449.00 50701.44 28792.215660
2021-12-28 50701.44 50704.05 47313.01 47543.74 45853.339240
2021-12-29 47543.74 48139.08 46096.99 46464.66 39498.870000
2021-12-30 46464.66 47900.00 45900.00 47120.87 30352.295690
2021-12-31 47120.88 48548.26 45678.00 46216.93 34937.997960

1598 rows × 5 columns

In [3]:
test_set
Out[3]:
open high low close volume
timestamp
2022-01-01 46216.93 47954.63 46208.37 47722.65 19604.46325
2022-01-02 47722.66 47990.00 46654.00 47286.18 18340.46040
2022-01-03 47286.18 47570.00 45696.00 46446.10 27662.07710
2022-01-04 46446.10 47557.54 45500.00 45832.01 35491.41360
2022-01-05 45832.01 47070.00 42500.00 43451.13 51784.11857
... ... ... ... ... ...
2023-04-20 28797.10 29088.30 28010.00 28243.65 76879.09372
2023-04-21 28243.65 28374.02 27125.00 27262.84 77684.76790
2023-04-22 27262.84 27882.72 27140.35 27816.85 36023.69686
2023-04-23 27816.85 27816.85 27311.25 27590.60 34812.09581
2023-04-24 27590.59 28000.00 26942.82 27439.77 49925.47299

479 rows × 5 columns

In [4]:
# Scale the training set
sc = MinMaxScaler(feature_range=(0, 1))
training_set_scaled = sc.fit_transform(training_set["close"].values.reshape(-1, 1))

# Initialize the input and output arrays
X_train = []
y_train = []

# Create training set arrays
for i in range(learn_period, len(training_set)):
    # Append learn_period values of the scaled training set to X_train as inputs
    X_train.append(training_set_scaled[i-learn_period:i, 0])
    # Append the next value in the scaled training set to y_train as the output
    y_train.append(training_set_scaled[i, 0])

# Convert the arrays to numpy arrays
X_train, y_train = np.array(X_train), np.array(y_train)

# Reshape the input arrays to be compatible with LSTM model
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

# Print the first two values of the first input array to check the format
print(X_train[0][0:2])
[[0.01703628]
 [0.01428964]]
In [5]:
# Create a sequential model for predicting stock prices
lstm_model = Sequential()

# Add the first LSTM layer with 50 units, return_sequences=True, and input_shape as the shape of our training data
lstm_model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))

# Add a Dropout layer with a rate of 0.2 to prevent overfitting
lstm_model.add(Dropout(0.2))

# Add another LSTM layer with 50 units and return_sequences=True
lstm_model.add(LSTM(units=50, return_sequences=True))
lstm_model.add(Dropout(0.2))

# Add another LSTM layer with 50 units and return_sequences=True
lstm_model.add(LSTM(units=50, return_sequences=True))
lstm_model.add(Dropout(0.2))

# Add a final LSTM layer with 50 units
lstm_model.add(LSTM(units=50))
lstm_model.add(Dropout(0.2))

# Add a Dense layer with 1 unit to produce a single output value
lstm_model.add(Dense(units=1))

# Compile the model using the Adam optimizer and mean squared error loss function
lstm_model.compile(optimizer='adam', loss='mean_squared_error')
 
# Fit the model on the training data with 50 epochs and a 0
lstm_model.fit(X_train, y_train, epochs=50, batch_size=32)
Epoch 1/50
50/50 [==============================] - 5s 7ms/step - loss: 0.0284
Epoch 2/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0031
Epoch 3/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0028
Epoch 4/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0026
Epoch 5/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0024
Epoch 6/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0026
Epoch 7/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0024
Epoch 8/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0025
Epoch 9/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0023
Epoch 10/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0026
Epoch 11/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0021
Epoch 12/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0023
Epoch 13/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0021
Epoch 14/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0021
Epoch 15/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0018
Epoch 16/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0021
Epoch 17/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0019
Epoch 18/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0022
Epoch 19/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0021
Epoch 20/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0022
Epoch 21/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0021
Epoch 22/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0022
Epoch 23/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0017
Epoch 24/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0017
Epoch 25/50
50/50 [==============================] - 0s 6ms/step - loss: 0.0018
Epoch 26/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0019
Epoch 27/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0018
Epoch 28/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0016
Epoch 29/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0018
Epoch 30/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0021
Epoch 31/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Epoch 32/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0017
Epoch 33/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0016
Epoch 34/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0018
Epoch 35/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0018
Epoch 36/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0017
Epoch 37/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Epoch 38/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0017
Epoch 39/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0016
Epoch 40/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0017
Epoch 41/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Epoch 42/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Epoch 43/50
50/50 [==============================] - 0s 6ms/step - loss: 0.0019
Epoch 44/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Epoch 45/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0014
Epoch 46/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0017
Epoch 47/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Epoch 48/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Epoch 49/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Epoch 50/50
50/50 [==============================] - 0s 7ms/step - loss: 0.0015
Out[5]:
<keras.callbacks.History at 0x1faee55a0e0>
In [6]:
# Get the actual stock price from the test set
real_stock_price = test_set.iloc[:, 1:2].values

# Combine the training and test sets to preprocess the inputs in the same way as the training set
dataset_total = pd.concat((training_set['close'], test_set['close']), axis = 0)

# Get the inputs from the combined dataset that correspond to the test set
inputs = dataset_total[len(dataset_total) - len(test_set) - learn_period:].values

# Reshape the inputs to have one column and apply feature scaling
inputs = inputs.reshape(-1,1)
inputs = sc.transform(inputs)

# Create the input data for the test set using the same sequence length as the training data
X_test = []
for i in range(learn_period, len(test_set) + learn_period):
    X_test.append(inputs[i-learn_period:i, 0])
X_test = np.array(X_test)

# Reshape the input data to fit the shape of the LSTM model input
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

# Use the trained LSTM model to predict the stock prices for the test set
predicted_stock_price = lstm_model.predict(X_test)

# Rescale the predicted stock prices back to the original scale
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
15/15 [==============================] - 1s 2ms/step
In [7]:
# Create a figure with a single trace
fig = go.Figure()

# Add the real stock price data to the figure as a scatter trace
fig.add_trace(go.Scatter(x=df.index, y=real_stock_price.flatten(), mode='lines', name='Actual Price'))

# Add the predicted stock price data to the figure as a scatter trace
fig.add_trace(go.Scatter(x=df.index, y=predicted_stock_price.flatten(), mode='lines', name='Predicted Price'))

# Set the figure's title and axis labels
fig.update_layout(title='Price Prediction', xaxis_title='Time (in periods)', yaxis_title='Price')

# Show the figure
fig.show()
In [8]:
# Define colors for predicted and real prices
pred_color = 'orange'
real_color = 'blue'

# Create scatter plot of predicted vs actual prices
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(predicted_stock_price, real_stock_price, alpha=0.5, s=80, color=pred_color, label='Predicted Prices')
ax.scatter(real_stock_price, real_stock_price, alpha=0.5, s=80, color=real_color, label='Real Prices')

# Add labels and title
ax.set_xlabel('Predicted Price', fontsize=14)
ax.set_ylabel('Actual Price', fontsize=14)
ax.set_title('Predicted vs Actual Prices', fontsize=16)

# Add grid and adjust axis limits
ax.grid(True, alpha=0.2)
ax.set_xlim([min(predicted_stock_price), max(predicted_stock_price)])
ax.set_ylim([min(real_stock_price), max(real_stock_price)])

# Add legend
ax.legend(loc='upper left', fontsize=12)

# Show the plot
plt.show()
In [9]:
# Set all values in the "predicted" column to 0
df["predicted"] = 0

# Replace the last n values in the "predicted" column with the predicted values 
# where n is the length of the predicted_stock_price array
df["predicted"].iloc[-len(predicted_stock_price.flatten()):] = predicted_stock_price.flatten()

# Calculate the difference between consecutive values in the "close" column
df["diff_close"] = df["close"].diff()

# Calculate the difference between consecutive values in the "predicted" column
df["diff_predicted"] = df["predicted"].diff()

# Shift the values in the "predicted" column up by 1
df["next_predicted"] = df["predicted"].shift(-1)

# Shift the values in the "close" column up by 1
df["next_close"] = df["close"].shift(-1)

# Calculate the difference between the next value in the "predicted" column and the current value
df["diff_predicted_next"] = df["next_predicted"] - df["predicted"]

# Calculate the difference between the next value in the "close" column and the current value
df["diff_close_next"] = df["next_close"] - df["close"]

# Calculate the mean evolution of the close price over the next 3, 5, 10, and 20 time steps
df["mean_evol_3"] = df["close"].shift(-3).rolling(3).mean() - df["close"]
df["mean_evol_5"] = df["close"].shift(-5).rolling(5).mean() - df["close"]
df["mean_evol_10"] = df["close"].shift(-10).rolling(10).mean() - df["close"]
df["mean_evol_20"] = df["close"].shift(-20).rolling(20).mean() - df["close"]

# Print the df
df
Out[9]:
open high low close volume predicted diff_close diff_predicted next_predicted next_close diff_predicted_next diff_close_next mean_evol_3 mean_evol_5 mean_evol_10 mean_evol_20
timestamp
2017-08-17 4261.48 4485.39 4200.74 4285.08 795.150377 0.000000 NaN NaN 0.000000 4108.37 0.000000 -176.71 NaN NaN NaN NaN
2017-08-18 4285.08 4371.52 3938.77 4108.37 1199.888264 0.000000 -176.71 0.000000 0.000000 4139.98 0.000000 31.61 NaN NaN NaN NaN
2017-08-19 4108.37 4184.69 3850.00 4139.98 381.309763 0.000000 31.61 0.000000 0.000000 4086.29 0.000000 -53.69 -92.550000 NaN NaN NaN
2017-08-20 4120.98 4211.08 4032.62 4086.29 467.083022 0.000000 -53.69 0.000000 0.000000 4016.00 0.000000 -70.29 -29.620000 NaN NaN NaN
2017-08-21 4069.13 4119.62 3911.79 4016.00 691.743060 0.000000 -70.29 0.000000 0.000000 4040.00 0.000000 24.00 140.673333 201.628 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2023-04-20 28797.10 29088.30 28010.00 28243.65 76879.093720 32297.560547 -553.45 -87.152344 32029.292969 27262.84 -268.267578 -980.81 -686.886667 NaN NaN NaN
2023-04-21 28243.65 28374.02 27125.00 27262.84 77684.767900 32029.292969 -980.81 -268.267578 31598.072266 27816.85 -431.220703 554.01 352.900000 NaN NaN NaN
2023-04-22 27262.84 27882.72 27140.35 27816.85 36023.696860 31598.072266 554.01 -431.220703 31109.390625 27590.60 -488.681641 -226.25 NaN NaN NaN NaN
2023-04-23 27816.85 27816.85 27311.25 27590.60 34812.095810 31109.390625 -226.25 -488.681641 30606.078125 27439.77 -503.312500 -150.83 NaN NaN NaN NaN
2023-04-24 27590.59 28000.00 26942.82 27439.77 49925.472990 30606.078125 -150.83 -503.312500 NaN NaN NaN NaN NaN NaN NaN NaN

2077 rows × 16 columns

In [10]:
# Create a figure with a single trace
fig = go.Figure()

# Add the real stock price data to the figure as a scatter trace
fig.add_trace(go.Scatter(x=df.index, y=df['close'], mode='lines', name='Actual Price'))

# Add the predicted stock price data to the figure as a scatter trace
fig.add_trace(go.Scatter(x=df.index, y=df['predicted'], mode='lines', name='Predicted Price'))

# Set the figure's title and axis labels
fig.update_layout(title='Price Prediction', xaxis_title='Time (in periods)', yaxis_title='Price')

# Update x-axis range to show data only from Jan 2022 to end of index
fig.update_xaxes(range=['2022-01-02', df.index[-1]])

# Show the figure
fig.show()
In [11]:
# Filter data using boolean masks
mask1 = (df["diff_predicted_next"] > df["diff_close"])
mask2 = (df["diff_predicted_next"] < df["diff_close"])

# Calculate metrics using masks and .loc/.mean()
print(len(df.loc[mask1 & (df["diff_close_next"] > 0)]))
print(len(df.loc[mask1 & (df["diff_close_next"] < 0)]))
print("-------------")
print(len(df.loc[mask2 & (df["diff_close_next"] < 0)]))
print(len(df.loc[mask2 & (df["diff_close_next"] > 0)]))
print("-------------")
print(df.loc[mask1, "diff_close_next"].mean())
print(df.loc[mask1, ["mean_evol_3", "mean_evol_5", "mean_evol_10", "mean_evol_20"]].mean())
print("-------------")
print(df.loc[mask2, "diff_close_next"].mean())
print(df.loc[mask2, ["mean_evol_3", "mean_evol_5", "mean_evol_10", "mean_evol_20"]].mean())
553
446
-------------
561
515
-------------
25.565735735735682
mean_evol_3     -1.501329
mean_evol_5     -4.680898
mean_evol_10    15.155755
mean_evol_20    44.893676
dtype: float64
-------------
-2.0527602230482818
mean_evol_3      45.300428
mean_evol_5      70.481616
mean_evol_10    113.828653
mean_evol_20    205.701012
dtype: float64

From these results, we can infer that the algorithm has some predictive power, as the number of cases where the predicted price difference was greater than the actual price difference and the actual price increased is higher than the number of cases where the predicted price difference was less than the actual price difference and the actual price decreased. Additionally, the mean of the four moving averages is higher in the first case, indicating that there may be a correlation between the moving averages and the algorithm's ability to predict price increases.

The results showed that the model had some predictive power, but further analysis is needed to fully evaluate its effectiveness. Stock price prediction is a challenging task, and caution should be taken when using the model's predictions for investment decisions. Nonetheless, this project provides a good foundation for future research and development in this area.